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COMPLEXITIES OF CONVEX COMBINATIONS AND BOUNDING 
THE GENERALIZATION ERROR IN CLASSIFICATION 

By Vladimir Koltchinskii 1 and Dmitry Panchenko 2 

University of New Mexico and Massachusetts Institute of Technology 

We introduce and study several measures of complexity of func- 
tions from the convex hull of a given base class. These complexity 
measures take into account the sparsity of the weights of a con- 
vex combination as well as certain clustering properties of the base 
functions involved in it. We prove new upper confidence bounds on 
the generalization error of ensemble (voting) classification algorithms 
that utilize the new complexity measures along with the empirical dis- 
tributions of classification margins, providing a better explanation of 
generalization performance of large margin classification methods. 

1. Introduction. Since the invention of ensemble classification methods 
(such as boosting), the convex hull conv(W) of a given base function class TC 
has become an important object of study in the machine learning literature. 
The reason is that the ensemble algorithms typically output classifiers that 
are convex combinations of simple classifiers selected by the algorithm from 
the base class 7i, and, because of this, measuring the complexity of the whole 
convex hull as well as of its subsets becomes very important in analysis of 
the generalization error of ensemble classifiers. Another important feature of 
boosting and many other ensemble methods is that they belong to the class 
of so-called large margin methods, that is, they are based on optimization 
of the empirical risk with respect to various loss functions that penalize not 
only for a misclassification (a negative classification margin), but also for a 
correct classification with too small positive margin. Thus, the very nature of 
these methods is to produce classifiers that tend to have rather large positive 
classification margins on the training data. Finding such classifiers becomes 
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possible since the algorithms search for them in rather huge function classes 
(such as convex hulls of typical VC-classes used in classification). 

This paper continues the line of research started by Schapire, Freund, 
Bartlett and Lee in [28] and further pursued in [2, 16, 19, 21, 26]. In these 
papers, the authors were trying to develop bounds on the generalization 
error of combined classifiers selected from the convex hull conv(7i) in terms 
of the empirical distributions of their margins, as well as certain measures 
of complexity of the whole convex hull or its subsets to which the classifiers 
belong. Our main goal here is to suggest new margin type bounds that are 
based to a greater extent on complexity measures of individual classifiers 
from the convex hull. These bounds are more adaptive and more flexible 
than the previously known bounds (but they are also harder to prove). 
They take into account various properties of the convex combinations that 
are related to their generalization performance as classifiers, such as the 
sparsity of the weights and clustering properties of base functions. 

The following notation and definitions will be used throughout the paper. 
Let X be a measurable space (space of instances) and let y = {— 1,+1} be 
the set of labels. Let P be a probability measure on X x y that describes the 
underlying distribution of instances and their labels. We do not assume that 
the label y is a deterministic function of x\ in general, it can also be random, 
which means that the conditional probability P(y = l|x) may be different 
from or 1. Let 7i be a class of measurable functions h : X — > [—1,1]. Denote 
by V(H.) the set of all discrete distributions on H and let T be the convex 
hull of n, 

^ = conv(W) := jy h(-)\(dh):\eV(H)y 

For / G T we assume that sign(/(x)) is used to classify x S X [sign(/(rr)) = 
meaning that no decision is made]. Functions / 6 T are sometimes called 
voting classifiers, since for a convex combination / = X) ^jhj the weight (co- 
efficient) Xj can be interpreted as the voting power of an individual classifier 
hj (they are also called ensemble classifiers). The generalization error of any 
classifier / G T is defined as 

(1.1) P(sign(/(x)) + y)= P(y/(x) < 0). 

Given an i.i.d. sample (Xi,Y{), . . . , (X n ,Y n ) from the distribution P, let P n 
denote its empirical distribution. For a measurable function g on X x y, 
denote 

„ n 

¥g= I g(x,y)d¥(x,y), F n g = n^ 1 £ g{X u Y$. 

Whenever it is needed, we use the same notation ¥g,¥ n g or F(A),P n (A) for 
functions g that depend only on x and for sets Ad X (the meaning of the 
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notation in this case is obvious) . The probability measure on the main sample 
space (on which all the random variables including the training examples 
are defined) will be denoted by Pr (not to confuse it with P). 

In the paper we study the generalization error (1.1) of classifiers from the 
convex hull of a class H which is typically assumed to be "small," a condition 
that is described precisely in terms of some complexity assumptions on H 
[see (2.2)]. A number of popular classification algorithms output classifiers 
of this type. Below we briefly discuss two of them: AdaBoost, which is the 
most well-known classification algorithm of boosting type, and also bagging. 
We provide some heuristic explanations of why these algorithms might have 
a tendency to output convex combinations of classifiers from the base class 
with a certain degree of sparsity of their weights and clustering of the base 
classifiers. 

AdaBoost. The algorithm starts by assigning equal weig hts wf ] = i to 
all the training examples (Xj,Yj). At iteration number k, k = 1, . . . , T, the 
algorithm attempts to minimize the weighted training error with weights 
Wj over the base class TL of functions h:S^{ — 1,1} (such that h^TL 
implies —h£H). If e k denotes the weighted training error of the approxi- 
mate solution h k of this minimization problem, the algorithm computes the 
coefficient 

1 , 1 - e k 
«fc : = o log , 

2 e k 

which is nonnegative since e k < \, and then updates the weights according 
to the formula 

(k+ l) w^e-YjakhkiXj) 

11] ■ '= — 

3 Z 

where Z is a normalizing constant that makes the weights add up to 1. After 
T iterations, the algorithm outputs the classifier / = J2k=i ^khk, where 

, OLk 



Typically, the class 7i is relatively small so that it is easy to design an efficient 
algorithm (often called a weak learner) of approximate minimization of the 
weighted training error over the class. The result of this, however, is that 
at many iterations the weak learner outputs classifiers h k from the base 7i 
whose weighted training error is just a little smaller than 1/2. If this is the 

case at iteration k, the coefficient a k is close to and the weights Wj 

(k) 

do not differ much from the weights . If the weak learner possesses 
some stability, this means that the base classifier h k+ i is close to the base 
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classifier h^. As a result, when the algorithm proceeds one observes a slow 
drift of the classifiers hk in the "hypotheses space" Ti, and the coefficients 
of these classifiers in the convex combination will be small until we reach a 
place in Ti where the stability of the weak learner breaks down and it outputs 
a classifier with a weighted training error significantly smaller than 1/2. 
Thus, one can expect a certain degree of sparsity (many small coefficients) 
and of clustering (many base classifiers that are close to one another) of the 
resulting convex combination. 

Bagging [9] . The algorithm at each iteration produces a bootstrap sam- 
ple drawn from the training data and outputs a classifier that minimizes 
the corresponding bootstrap training error over the base class Ti. After T 
iterations the algorithm averages the resulting T base classifiers, creating a 
convex combination with equal weights := y. Again, if the weak learner 
possesses some stability and since each bootstrap sample is a "small per- 
turbation" of the training data, one can expect some degree of clustering of 
the base functions involved in the convex combination. (In this case, it is 
impossible to talk about the sparsity of the coefficients since all of them are 
equal.) 

These explanations are of course rather heuristic in nature and somewhat 
vague. The reality might be much more complicated since, for instance, 
weak learners are not necessarily stable. Often, lack of stability of the weak 
learner is viewed as an advantage since it allows the algorithm to create more 
"diverse" ensembles of base classifiers and to produce a combined classifier 
with larger margins. However, the bounds of this paper seem to suggest that 
the performance of combined classifiers is related to a rather delicate trade- 
off between their complexity and margin properties. So, stability of the weak 
learner is a good and a bad property at the same time (one should rather 
talk about optimal stability) . The phenomenon of sparsity of the coefficients 
is much better understood in the case of support vector machines (see [30] 
for recent results in this direction) and the development of these ideas for 
ensemble methods remains an open problem that is beyond the scope of 
our paper. However, regardless of how close this explanation is to the truth, 
some degree of sparsity and clustering in convex combinations output by 
popular learning algorithms can be observed in experiments (see some very 
preliminary results in [20] and more results in [1]). Our intention here is 
not to study why this is happening, but rather to understand what kind of 
influence sparsity and clustering properties of convex combinations output 
by AdaBoost and other classification algorithms have on their generalization 
performance. 

Another motivation to study the complexities based on sparsity and clus- 
tering comes from learning theory, where it has become common to use 
global or localized complexities based on sup-norm or continuity modulus of 
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empirical or Rademacher processes involved in the problem and indexed by 
the class T in order to bound the generalization error (see [5, 8, 17, 18, 23]). 
However, these complexities do not necessarily measure the accuracy of mod- 
ern classification methods correctly. The reason is that they are based on 
deviations of the empirical measure F n from the true distribution P uni- 
formly over the whole class T or over £2 (IP)-balls in the class, while the 
learning algorithms might have some intrinsic ways to restrict complexities 
of the classifiers they output by searching for a minimum of empirical risk in 
some parts of the class T with restricted complexity (although this part is 
typically data-dependent, cannot be specified in advance and has to be de- 
termined in a rather complicated model selection process). Thus, there is a 
need to develop new more adaptive bounds that take into account complex- 
ities of individual classifiers in the class and can be applied to the classifiers 
output by learning algorithms. A possible general approach to such com- 
plexities can be described as follows. Suppose {Q} is a family of subclasses 
of the class T and let c n {Q) be a complexity measure associated with the 
class Q (e.g., it can be based on a localized Rademacher complexity of Q). 
Suppose also it has been observed that a learning algorithm tends to output 
classifiers from subclasses Q with small values of complexity c n (Q) ("sparse 
subclasses"). Then a natural question to ask is whether the quantity of the 
type Cn(f) := inf{c n (£/) :Q B /} (which is already an individual complexity 
of /) provides bounds on the generalization error of /. In the case where 
{£?} is a countable family of nested subclasses, such questions are related 
to structural risk minimization and other model selection techniques. How- 
ever, in classification one often encounters more complicated situations, such 
as the setting of Theorem 5 below, where a natural family {Q} is neither 
countable nor nested and consists of distribution-dependent classes indexed 
by a functional parameter (see the definition of the classes Tq p N before 
Lemma 2). The study of complexity measures that occur in such more com- 
plicated model selection frameworks is our main subject here. In the next 
section we will try to develop several new approaches to measuring com- 
plexities of convex combinations and use these complexities in new bounds 
on generalization error in classification. 

2. Main results. The first important result about the generalization er- 
ror of classifiers from T = conv(W) was proved in [28], where the gener- 
alization ability of voting classifiers is explained in terms of the empirical 
distribution ¥ n (yf(x) < 5) of the quantity yf(x) called margin. The authors 
prove that if Ti = {2I(x G C) — 1 : C G C}, where C is a Vapnik-Chervonenkis 
class of sets with VC-dimension V (for definitions see, e.g., [32] or [12]), then 
for all t > with probability at least 1 — e _t for all / G J- = conv(W) we have 
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<2 ' 1) £iifofi] (,„ fe/W£ , + ,((^)- + (i)-)), 

where K > is an absolute constant. To understand this result, let us give 
one interpretation of the margin yf{x). One can think of yf(x) as the "con- 
fidence" of prediction of the example x, since / classifies x correctly if and 
only if yf(x) > 0; and if f(x) is large in absolute value it means that it makes 
its prediction with high confidence. If / classifies most of the training ex- 
amples with high confidence, then for some 5 > (which is not "too small") 
the proportion of examples F n (yf(x) < 5) classified below the confidence S 
will be small. The second term of the bound is of the order (y^) -1 , and 
will also be small for large n, which makes the bound meaningful. 

This result was extended by Schapire and Singer in [29] to classes of real- 
valued functions, namely, to so-called VC-subgraph classes (for definition 
see [32]), and was further extended in several directions in [19] and [21]. 
The main idea of this follow-up work was to replace the second term of 
the bound proved by Schapire et al. [28] by a function e n (J r ;5;i) that has 
better dependence on the sample size n and on the margin parameter 5. 
The bounds obtained in [19] are also more general: they apply to arbitrary 
function classes J 7 , not only to the convex hulls. 

Given a probability distribution Q on X and a class TL of measurable 
functions on X, denote 

d Q Mi9):=(Q(f-g) 2 ) 1/2 , f,gen, 

the £2(Q)-distance in TL. Let the covering number N dQ 2 (TL,u) be the min- 
imal number of dg^-balls of radius u > with centers in TL needed to cover 
TL. The logarithm of this number Hd Q 2 (TL,u) := log Nd Q 2 (TL, u) is called 
the u-entropy of TL with respect to oIq^- In what follows, we will also use 
£p(Q)-distances and the corresponding covering numbers and entropies for 
pG [l,+oo]. 

Often, it makes sense to assume (and it will be assumed in what follows) 
that the family of weak classifiers TL satisfies the condition 

(2.2) sup N d Q2 (H,u) = 0(u- y ) 

Q&V{X) 

for some V > 0, where V{X) is the set of all discrete distributions on X . 
For example, if TL is a VC-subgraph class with VC-dimension V(TL), then 
by the well-known result that goes back to Dudley and Pollard (see [14] for 
the current version), (2.2) holds with V = 2V(TL), namely, 

/2e\ v ^ 

(2.3) sup N dQ2 (H,u)<e(V(H) + l)( — ) 
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Under the condition (2.2), the bound (2.1) was slightly improved by 
Koltchinskii and Panchenko in [19], who proved that for all t > with prob- 
ability at least 1 — e~ t for all / £ J- = conv('H) we have 

WW < o) < , j* (p.(»/(.) < «) + k ((£) 1/2 + (i) 1/2 ) ) . 

thus getting rid of the logarithmic factor log 2 (n/<5) in the second term of 
(2.1). By itself this improvement is insignificant, but the generality of the 
methods developed in [19] allowed the authors to obtain this type of bound 
for general classes T of classifiers (not necessarily the convex hulls of VC- 
classes) and to make some significant improvements in other situations, for 
example, for neural networks. (The first margin type bounds for general 
function classes, including neural networks, were based on /^-entropies 
and shattering dimensions of the class; see [4].) Moreover, it was shown 
in [19] that (2.1) can be further improved in the so-called zero-error case, 
when ¥ n (yf(x) < 5) is small for 5 — > 0. Namely, the following result holds. 
Assume that H satisfies (2.2) and let a = 2V/(V + 2). Then, for all t > 
with probability at least 1 — e~ l for all / G T we have (with some numerical 
constant K > 0) 

V(yf(x) < 0) 

This bound will be meaningful if 

5* = sup{<5 : 5 2a / {2+a ^F n {yf{x) <S)< n - 2 /( 2+Q )} 

is not "too small," which means that ¥ n (yf(x) < 5) should decrease "fast 
enough" when 5 — > 0. Actually, this bound holds not only for classes of 
functions T = conv(W) where Ti satisfies (2.2), but for any class T such 
that 

(2.5) sup \ogN d 2 (T,u)=0{u~ a ), a 6 (0,2), 

or even when the uniform entropy in (2.5) is replaced by the entropy with 
respect to empirical ^-distance dp ni 2- It is well known that the convex 
hull J- = conv(H) of the class 7i satisfying (2.2) satisfies (2.5) with a = 
2V/ (V + 2) (see, e.g., [32]), which explains a particular choice of a in (2.4). 
Under the condition (2.5) on T the bound of (2.4) is optimal as shown in 
[19] by constructing a special class of functions T in Banach space of 
uniformly bounded sequences. Finally, note that the constant K involved in 
the bound can be redistributed between the two terms: in front of the term 
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^n(yf( x ) < <5) one can put a constant arbitrarily close to 1 at the price of 
making the constant in front of the second term large. 

In [21] Koltchinskii, Panchenko and Lozano proved the bounds on gener- 
alization error under more general assumption on the entropy of the class 
T: 

(2.6) f (JF; u ) du < D^{x), x > 0, 

J 

with some constant D > and with a concave function ip. They showed that 
in this case the term 

i \ 2o/(2+a) 
1 X „-2/(o+2) 



involved in the bound (2.4) should be replaced by the quantity e^(5) defined 
as the largest solution of the equation 



5y/n 

leading to so-called ^-bounds on generalization error. 

Margin-type bounds on generalization error can be also expressed in terms 
of other entropies, in particular, £oo-entropy and in terms of shattering 
dimension of the class, as in the papers of Bartlett [4] (that preceded [28]) 
and of Antos, Kegl, Linder and Lugosi [2]. A typical bound in terms of 
Coo -entropy is of the form 

/ logEJVd- (T\8/2)+t 

(2.7) F(yf(x)<0)<K inf i [V n (yf(x) < S) + 



<5e(o,i] V n 

for all / G T with probability at least 1 — e _t . The £oo -entropy is always 
larger than £2-entropy, but for special classes of functions the difference 
might be not very significant, and because of a different form the £oo-bound 
has sometimes an advantage over the £2-bounds. However, the detailed com- 
parison of these bounds goes beyond the scope of this paper. 

Numerous experiments with AdaBoost and some other classification algo- 
rithms showed that in practice the bounds of type (2.4) hold with smaller 
values of a than the theoretical considerations (based on the estimates of 
the entropy of the whole convex hull) suggest. This means that ensemble 
classifiers often belong to a subset of the convex hull of a smaller entropy 
than the entropy of the whole convex hull. A natural question is whether 
it is possible to incorporate in the bound on generalization error the infor- 
mation about the individual complexity of the actual classifier rather than 
use global complexity of the whole convex hull. In other words, is it possi- 
ble to replace the function ip from condition (2.6) by a data-dependent and 
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classifier-dependent function that would make the V'-bounds on generaliza- 
tion error more adaptive? 

The fact that the margin type bounds hold in such generality means, at 
least on the intuitive level, that the explicit structure of the convex hull 
is not used there. On the contrary, in this paper we will heavily utilize 
the structure of the convex hull and prove new bounds that reflect some 
measures of complexity of convex combinations. 

The idea of using a certain measure of complexity of individual convex 
combinations already appeared in [21], where the authors suggested a way to 
use a rate of decay of weights Xj in the convex combination / = J2j=i ^jhj 
to improve the bound on the generalization error of /. This measure, called 
approximate 7-dimension, is defined as follows. Let us assume that the 
weights are arranged in the decreasing order |Ai| > IA2I > • • • ■ For a number 
7 £ [0, 1], the approximate 7 -dimension of / is defined as the smallest inte- 
ger number d > such that there exist T > 1, functions hj G H, j = 1, . . . , T, 
and numbers Xj € R, j = 1, .. . ,T, satisfying the conditions / = J2j=i ^jhj, 
Ej=i|A i |<l and Ej= d+i ^ 7- Note that in [21] the authors dealt with 
the symmetric convex hull, so the coefficients Xj are not necessarily positive. 
The 7-dimension of / will be denoted by d(f; n f). 

Then, for all t > with probability at least 1 — e~* we have for all / G 
T = conv(7i) (again with a = yr^) 



This is an improvement over (2.4), which can be seen by comparing the 
infimum over 7 of the expression in the bound with the value of the ex- 
pression for 7 = 1 and noting that d(f; 1) = 0. For example, if the weights 
decrease polynomially |Aj| ~ > 1, or exponentially |Aj| ~ ,j3 > 0, 

then explicit minimization over 7 shows that in these cases (2.8) can be a 
substantial improvement over (2.4) (see examples in [21]). 

Our first result in this paper also deals with bounding the generalization 
error of a classifier / = J2j=i ^jhj £ F = conv(W) in terms of complexity 
measures taking into account the sparsity of the weights Xj. Theorem 1 
below is a new version of the results of [21] [specifically, of the bound (2.8)] 
that can be interpreted as interpolation between zero-error and nonzero- 
error cases; as its corollary we will give a new short proof of (2.8). Theorem 2 
is another result in this direction with a different dependence of the bound 
on the sample size and the margin parameter 5. 



Hyf(x) < 0) 



(2.8) 
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Let <& = {ip$ : R — ► [0, 1] : 8 € A C K+} be a countable family of Lipschitz 
functions such that the Lipschitz norm of ipg is bounded by that is, 

\<ps(s{) - ips(s2)\ < 5~ 1 \si - s 2 |, 

and J2seA^ < 00 ■ I n applications, such functions are frequently used as 
loss functions in empirical risk minimization procedures of boosting type 
that output large margin classifiers. One can use a specific choice of A = 
{2~ k : k > 1}. The following theorem holds. 

Theorem 1. If (2.2) holds, then for all t > with probability at least 
1 - e~ l for all f Gf =conv(H) and 5 £ A = {2~ fc : k > 1}, 

vMyf(x))-VnMvf(x)) 

(P W (y/(x)))V2 

< K inf f ( log - + ( A ^ ^ yf ^~ a/A + (^ 1/2 



7\\ n 5/ \S J n 1 / 2 \n 

where a = 2V/(V + 2). 

Let us take, for example, cpg such that v^sl 5 ) — 1 f° r s < 0, 995 = for s > 5 
and (^5 is linear for < s < 5. For any probability measure Q (e.g., Q = P or 
P n ), one can write 

(2.9) Q(yf(x) < 0) < Q<p s (yf(x)) < Q(yf(x) < 5). 

For this choice of (p$ and for a fixed / let us denote a = ¥ipg(yf(x)) and 
b = W n (p$(yf(x)). It is clear that after minimizing the expression involved in 
the right-hand side over 7, the inequality of Theorem 1 can be written as 

a<&W /2 W /2 ~ a/4 , 

where u and v are constants depending on the parameters involved in the 
inequality. Since the right-hand side of the last inequality is strictly concave 
with respect to a, this inequality can be uniquely solved for a or, in other 
words, it can be equivalently written as a < p(b) for unique positive func- 
tion p, which is, obviously, increasing in b. Combining this with (2.9) we 
get 

Hyf(x) < 0) < F<p s (yf{x)) < p(¥ n ip s (yf(x))) < p(F n (yf(x) < 5)). 
The analysis of p will readily imply the main result in [21]. 

Corollary 1. // (2.2) holds and a = 2V/(V + 2), then for any t > 
with probability at least 1 — e~ l (2.8) holds for all f £ T = conv(H). 
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Roughly speaking, Corollary 1 describes the zero-error case of Theorem 1. 
Thus, Theorem 1 is a more general and flexible formulation of the main result 
in [21], as it interpolates between zero- and nonzero-error cases. 

Next we will present a new bound on the generalization error of voting 
classifiers that takes into account the sparsity of weights in the convex com- 
bination. Given A £ ViTL) and f(x) = fh(x)X(dh), we can also represent 
/ as / = J2k=i ^khk with T < oo (since A is a discrete probability mea- 
sure). Without loss of generality let us assume that Ai > A2 > • • • . We define 
ld{f) = J2k=d+i an d for 5 > we define the effective dimension function 
by 

(2.10) e n (f, 5) = Q mm T (d + log n 

This name is motivated by the fact that (as will become clear from the proof 
of Theorem 2 below) it can be interpreted as a dimension of a subset of the 
convex hull conv("H) that contains a "good" approximation of /. 

Theorem 2 (Sparsity bound). If (2.2) holds, then there exists an ab- 
solute constant K > such that for all t> with probability at least 1 — e _t 
for all A G V(H) and f(x) = J h{x)\(dh), 

Hyf(x) < 0) < inf (U 1 ' 2 + (F n (yf(x) <5) + U) 1/2 f, 

(56(0,1] 

where 

u = K (VfAM l0 - + *y 

\ n on) 
It follows from the bound of the theorem that for all e > 

W(yf{x) < 0) < M (p. + e)F n (yf(x) < S) + (2 + U 

which is a more explicit version of the result. Results of similar flavor can 
be, in principle, also obtained as a consequence of entropy-based margin- 
type bounds, in particular, using the £oo-entropy. However, we believe that 
the more direct probabilistic argument we use in our proof (that goes back 
to [28]) is very natural in this problem. Moreover, the same argument is 
typically present in the derivation of entropy bounds for the convex hull or 
its subsets needed in alternative proofs. Taking this into account, the di- 
rect proof we give here is shorter and easier. This becomes especially clear 
in Theorems 3 and 4, where the entropy bounds on subsets of the convex 
hull with restrictions on the variance of convex combinations (see the def- 
initions below) are most likely unknown. It is also worth mentioning that 
the same randomization idea combined with a couple of other techniques 
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can be used in some other situations where probabilistic interpretation is 
not straightforward, for instance, for kernel machines and their hierarchies 
(see [1]). 

The following result was proved in [11]. Let H be a finite class with 
N = card(W) and let 5* be the minimal margin on the training examples, 
that is, 

5, = 6, (/) = mmYJiXi) = sup{5 : F n (yf(x) <S) = 0}. 

i<n 

Then for any t > with probability at least 1 — e _i we have, for all / G T = 
conv(W) such that £»(/) > (32/N) 1 / 2 , 

(2.H) nyfix) < 0) < K f]^ + i\ 

We notice that 

e n (/,<5)= o mm r ^+^||^logn) <llogn, 

where the last inequality follows by taking d = in the expression under 
the infimum. This shows that as a corollary of Theorem 2 one can extend 
the result of Breiman [11] to much more general classes of functions [the 
role of log N in (2.11) being now played by Vlogn]. Moreover, the bound of 
Theorem 2 interpolates between zero-error and nonzero-error cases without 
any assumptions on the empirical distribution of the margin P n (y/(x) < S). 
To illustrate the role of the effective dimension e n (f,5) let us suppose that 
the weights \j decrease polynomially or exponentially fast: 

Example, (a) If Aj ~ for /3 > 1, then one can explicitly minimize 
the expression in (2.10), which in the zero-error case ¥ n (yf(x) < 5*) = 
gives 

ww < )<* OT (-g*^irf£ + £), 

which can be a significant improvement for large values of (3. 

(b) If Xj ~ e~ 3 , then again one can explicitly minimize the expression in 
(2.10), which in the zero-error case P n (y/(x) < <5*) = gives 

P(y/(x)<0)<A-(-log 2 ^ + -Y 

It is quite clear that one can come up with many alternative definitions 
of sparsity measures of convex combinations that are based only on the 
sizes of coefficients. For instance, one can measure the size of the "tail" of 
the convex combination (after the d largest coefficients have been removed) 
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using a different norm instead of the £i-norm we used above. However, our 
approach seems to be reasonable since it is based on the idea of splitting the 
whole convex combination into two parts, one of them being ti-dimensional 
and another one belonging to a rescaled convex hull of H (the whole convex 
hull times a small coefficient, which is a natural "neighborhood" of in the 
convex hull). 

The major drawback of this type of bound, however, is that it takes into 
account only the size of the coefficients of the convex combination, but not 
the "closeness" of the base functions involved in it. Such a "closeness" (re- 
flected, e.g., in the fact that the base functions classify most of the examples 
the same way or, more generally, can be divided into several groups with 
the functions within each group classifying similarly) could possibly lead to 
further complexity reduction. 

We suggest below two different approaches to this problem. The first 
approach is based on interpreting the convex combination as a mean of a 
function h randomly drawn from the class Ti with some probability distribu- 
tion A. Then in order to measure the complexity of the convex combination 
it becomes natural to bring in probabilistic quantities such as the variance 
of the convex combination introduced below. In the extreme case, when all 
classifiers hj are equal, / belongs to a simple class Ti. itself rather than to 
the possibly very large class J~; in this case, the variance is equal to and 
this is reflected in our generalization analysis of /. This approach is clearly 
related to the randomization proof of margin type bounds in [28], but its 
real roots are in the well-known work of B. Maurey (see [27]) that provided 
a probabilistic argument often used in bounding the entropy of the convex 
hull. The approach might be also of interest to practitioners since variance 
can be easily incorporated in risk minimization techniques as a complexity 
penalty. The generalization bounds based on the notion of variance are given 
in Theorems 3 and 4. 

The second approach does not rely on the probabilistic interpretation, 
but rather exploits the nonuniqueness of representing functions by convex 
combinations and is based on covering numbers of the set of base functions 
in "optimal" representations of /. Thus, the metric structure of the base 
class replaces in this approach the probabilistic structure. The generalization 
bound based on this approach is given in Theorem 5. 

Despite the fact that, possibly, there might be many other ways to define 
complexities of this type, we believe that the approaches we are using have 
very natural connections to important mathematical structures involved in 
the problem. 

Given A € V(TC), consider 




k=l 
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We ask the following question: what if the functions h±,...,hT are, in some 
sense, close to each other? For example, n~ l J2k=i(.hi(Xk) — hj^X^)) 2 is small 
for all pairs In this case, the convex combination can be approximated 
"well" by only one function from H. Or, more generally, one can imagine 
the situation when there are several clusters of functions among hi, . . . , hj- 
such that within each cluster all functions are close to each other. This 
information should be reflected in the generalization error of classifier /, 
since it can be approximated by a classifier from a small subset of T . Below 
we prove two results in this direction. We will start by describing the result 
where we consider hi,...,hx as one (hopefully "small") cluster, and then 
we will naturally generalize it to any number of clusters. 

We define a pointwise variance of h with respect to the distribution A by 

(2.12) o\(x) = j (h(x) - j h(x)X(dh) S j X(dh). 

Clearly, a\{x) = if and only if 

h(x) = J h(x)X(dh), A-a.e. on 7i, 

or, equivalently (in the case of a discrete measure A), if h\{x) = hz{x) for 
all hi,h% G 7i with A({/ii}) > 0, A({/i2j-) > 0. The complexity character- 
istics of a similar flavor are sometimes used in the current work on PAC 
Bayesian bounds on generalization performance of aggregated estimates for 
least square regression; see [3]. 

Theorem 3. If (2.2) holds, then there exists an absolute constant K > 
such that for all t > with probability at least 1 — e~* for all A £ V(H) and 
f(x) = f x (x) = fh(x)X(dh), 



nvh(x) < o) 



< K inf I F n (yf x (x) < 5) + F n (a 2 x (x) > 7 ) + ^ ^g 2 -, + -)■ 
0<5<7<i\ no z o nj 

Remark. The following simple observation might be useful. Since 

Pn(o*(a0>7)<^, 
7 

one can plug this into the right-hand side of the bound of the theorem and 
then optimize it with respect to 7. The optimal value of 7 is 

Wlog(n/5) 
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(we are assuming here P n <7 2 > 0!), which immediately leads to the following 
upper bound on generalization error: 

K M . (P„( 9/X W < S) 4- 2 ^fl l0 » A _L log 2 » + 1 ) . 
o<<5<7\ ^no o no z o n) 

This is to be compared with the bound (2.1) and it shows that the quantity 
P n cr 2 might provide an interesting choice of complexity penalty in classifica- 
tion problems of this type. More generally, for p > 1 and (again, under the 
assumption P n <7^ p > 0) 

_ (P n <7? )l/(P+l) n l/(P+l)(j2/(p+l) 

7 yVb+i) log 2 /(P +1 ) (n/5) A 
we are getting the bound 

K M (w n (yf x (x)<6) 

0<5<7 \ 

^ nP/Cp+iJjap/Cp+i) log ^ n5 2 i0 8 j + „ 

In the limit p — > oo this yields the bound [provided that mayLi<j< n a\{Xj) > 
0] 

K inf 2 (p w (, /a (,) < S) + log 2 2 + 1 

0<5<maxi<j< n erf (X,) V no" n 

[which should be compared with (2.11); note the presence of the variance in 
the numerator]. 

The result of Theorem 3 is, probably, of limited interest since there is no 
reason to expect that the "global variances" of convex combinations output 
by popular learning algorithms are necessarily small. It is much more likely 
that it would be possible to split a convex combination into several clusters, 
each having a small variance. This is reflected in the following definition. 

Given m > 1 and A G V(H) } define a set 

C m (X) = I (qi, . . . , a m , A\ . . . , X m ) : X k G V(H), a k > 0, a k \ k = A j . 
For an element c G C m (A), we define a weighted variance over clusters by 

m 

(2.13) a 2 {c-x) = Y j al° 2 Ax)i 

k=l 

where <7 2 fe (x) are defined in (2.12). If indeed there are m small clusters among 
functions h\, . . . ,hx, then one should be able to choose an element c G C m (X) 
so that <t 2 (c; x) will be small on the majority of data points X\, . . . ,X n . 
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Theorem 4. If (2.2) holds, then there exists an absolute constant K > 
such that for all t> with probability at least 1 — e~* for all A G V(7i) and 
f(x) = f x (x) = fh(x)X(dh), 

nyfx(x) < 0) 



<Kinf inf inf [¥ n (yf x (x) < S) 

~ m>l c6 C™(A)0<5< 7 <lV 

+ P n (a 2 (c; s) > 7) + log 2 7 + - V 

If we define the number of (7, <5)-clusters of A as the smallest m for which 
there exists c£C\ such that 

Pn(a 2 (c;x)> 7 )<^log 2 ^ 

and denote this number by m,\(n,7,<5), then the bound implies that for all 
A G V(H) 

F(yf x (x) < 0) < K inf fp„(y/ A (x) < 5) + Kl^Vj^jh bg2 n + t 
0<(5<7\ no z n 

The choice of 7 = 5 gives an upper bound with the error term (added to the 
empirical margin distribution) of the order 

x log 7' 

no 

which significantly improves earlier bounds provided that we are lucky to 
have a small number of clusters rh\(n,5,5) in the convex combination. 

We now turn to a different approach to measuring complexity of con- 
vex combinations. It is based on empirical covering numbers of the set of 
functions involved in a particular convex combination. Let H be a class of 
measurable functions (classifiers) from X into { — 1,1}, such that 7i satis- 
fies (2.2). It is interesting to note that in this case the condition (2.2) is 
equivalent to the condition that the class of sets C := {{h = +1} : h G TC} is 
Vapnik-Chervonenkis (see, e.g., [13]). 

As before, Ti will play the role of a base class. Let J- := sconv('H), that 
is, T is the symmetric convex hull of 7i, 

( N N \ 

sconv(W) := I ]T Xih h hi G H, A» G R, ^ |A*| <l,iV>ll. 

U=l i=l J 

For / G J-, a probability measure Q on X and p G [1, +00], define 
N dQp (f,e) := mf{N dQ jH',e) :H' C H, f G sconv(W)}- 
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Let us call a subset H! C 7i a base of / G sconv('H) iff / £ sconv(W'). Then 
N dQ (f',s) is the minimal e-covering number of bases of /. Let 

:= J^N drn 2 (f,e)log(l/e)de. 

As earlier in this section (see also [21]), for a concave nondecreasing function 
ip on [0, +00) with ip(0) = 0, we define £%(5) as the largest solution of the 
equation 

with respect to e. Let now 

:=4»C/.-)(j). 

The function ip n {f, •) can be viewed as a data- and classifier-dependent es- 
timate of the entropy integral in the condition (2.6), and the bound of The- 
orem 5 below is an adaptive version of ^-bounds developed in [21]. 

Theorem 5. If a class of measurable functions 7i = {h: X — > {— 1,+1}} 
satisfies (2.2), then for all t > Clog 2 n, with probability at least 1 — e~* the 
following bound holds for all f G T : 



< 0} < K inf 

5e(0,i] 



F n {yf(x)<S}+e n (f,S) + -^ 

no z 



where K, C > are absolute constants. 

Remark 1 . Clearly, for all e > 

N dF ^(f,e)<N dpn ^f,e), 

and since the functions in TL take their values in {—1,1}, Nd v ^ (/,£■) does 
not depend on e for all e < 2. Therefore, in this range of e we will use the 
notation Nd rn00 {f) for it. This quantity is always bounded by 2 n and it 
shows how many classifiers hj G 7i that differ on the sample are involved in 
the "most economical" representation of / G sconv(W) (so it can be viewed 
as a dimension of /). The following bound is trivial: 

Mf, S) < 2^N dFnoo (f)5 ] fk^, 5<e~\ 

and it shows, in particular, that tjj n (f,5) is well defined. It also shows that 
the function i n (f,5) involved in the bound of the theorem can be replaced 
by the following upper bound that has a much simpler meaning: 

■log- 



$N drn ^(f) 

[although i n (f,5) can be much smaller than this upper bound]. 
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Remark 2. In fact, the bound of the theorem can be improved by in- 
troducing 

H n (f,e):=N dr ^(f^)log- £ Ae- 2V ^ 
and defining the function 



<5 



Mf,t,5):= I HV 2 [f,evJ- ) de. 




o 

Then one can define i n (f, t, 8) as e}J(5) with := ip n (f, t, •)• It follows from 
the proofs below that for all t > Clog 2 n, with probability at least 1 — e _t 
the following bound holds for all / 6 T: 

P{y/(x) < 0} < # inf [p n {y/(x)< 8} + e n (f,t, 8) + -L 
<5e(o,i] L no z _ 

with some constants K, C > 0. The term £~ 2V /( V +*) in the definition of 
H n (f,e) is (up to a constant) a well-known upper bound on the entropy of 
the convex hull of a VC-type class. The definition of H n (f,e) is based on an 
upper bound (see Lemma 2 below) on the entropy of the restricted convex 
hull of TL defined (given a probability measure Q and p > 1) as 

{/ G sconv(W) :Ve : N dQp (f,e) < N(e)}, 

where N is a given nonincreasing function. In fact, any other upper bound 
on the entropy of such sets can be used instead of H n (f,e). Apparently, 
more subtle bounds than the result of Lemma 2 (that interpolate better 
between the case of finite-dimensional convex combinations and the case of 
the whole convex hull) should exist and allow one to improve the bound 
of Theorem 5, but at the moment we do not know how to prove a better 
bound. Theorem 5 can be extended to classes Ti of functions taking values in 
[—1,1] (not necessarily binary functions), but its formulation becomes more 
complicated since it involves both C^i^n)- an d £i(P n ) -entropies in this case. 

3. Proofs. Theorem 6 will be the main technical tool in the proofs of 
Theorems 1-4. This theorem extends the inequality of Vapnik and Chervo- 
nenkis for VC-classes of sets and VC-major classes of functions to classes of 
functions J- = {/ : X — > [—1, 1]} satisfying the uniform entropy condition 

POO 

(3.1) / log 1/2 N(T,u)du <oo, 

Jo 

where 

N(T, u) = sup N d 2 (F, u) . 
QeV(x) 

For instance, it obviously holds under (2.2) for T = conv(H), as it follows 
from the well-known bounds on the entropy of the convex hull. 
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Theorem 6. If T = {/ : X —> [0, 1]} is a class of [0, l]-valued functions 
that satisfies (3.1), then there exists an absolute constant K > such that 
for any t > with probability at least 1 — e~ l for all f € T 

(3.2) ¥f-¥ n f<K[n~ l l 2 J^ log 1 / 2 N(T,u)du + J, 
and mt/i probability at least 1 — e _i for all f 

/ /"(Pri/) 172 /f[P f\ l / 2 \ 

(3.3) ¥ n f-¥f<K(n^ 2 J o log 1 / 2 N{F,u) du + (^J J. 

Proof. Equation (3.2) is Corollary 1 in [25]. Equation (3.3) is not for- 
mulated in [25] explicitly but it is proved similarly to (3.2). Equations (3.2) 
and (3.3) also follow easily from Corollary 3 in [26]. □ 

There are two features of this result that make it particularly useful. First 
of all, it is well known (see [13]) that if, given p > 0, we look at the layer of 
functions J- p = {/ £ J- : P/ < p}, then the typical value of the deviation P/ — 
P n / on this layer or, in other words, the expectation Esup{P/ — P n /:/ £ 
Fp}, can be estimated by the entropy integral 

n -i/2 I P \ og V^N{J : ,u)du, 
Jo 

where the upper limit -^Jp measures the size of J- p . This simply reflects the 
fact that functions with smaller mean P/ will have smaller fluctuations. 
Theorem 6 says that this happens on all layers at the same time, which 
gives us an adaptive control over the whole class T. The second important 
feature of this result is that the deviation from a typical value is controlled 
for each function individually by the term (Wf/n) 1 / 2 . This is convenient 
from the point of view of structural risk minimization since one only has to 
estimate the typical value on each class to which a function / may belong, 
but the deviation term is left unchanged. For other results in this direction 
we refer the reader to [26]. 

Given an integer d > 1 , denote 

{d d 
i=l i=l 

Again, let <3? = {ip$ : R — > [0, 1] : 5 £ A C R+} be a countable family of Lips- 
chitz functions such that Lipschitz norm of tps is equal to 5" 1 and J^SeA $ < 
oo. One can use a specific choice of A = {2~ fc : k > 1}. For a > 0, b > we 
define 

<f>(a,b)= (a ~ 6)2 /(a>6), 
a 

and for a = we let (f>(a, b) = </>(0, b) = 0. The following theorem holds. 
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Theorem 7. If (2.2) holds, then there exists K > such that for all 
t > with probability at least 1 — e~ l we have for all d> 1, / G and J E A, 

(3.4) 0(Pw(j//(x)),P„ W (y/(x))) < A'f— log ^ + -Y 

Proof. The proof is a straightforward application of Theorem 6. We 
will proceed in several steps. 

Step 1 (Estimating covering numbers). First of all, if given a class of 
measurable functions on X, T = {/ : X — > [0, 1]}, we introduce a new class 
of measurable functions 

T y = {g(x, y) = yf(x) : X x y - [-1, 1] : / G F] 

defined on X x y, then 

N(F y ,u)=N(F,u) 
since for any (xi, j/i), . . . , (cc n , y n ) and any /i, /2 6 ^ we have 

-y^/io^) -yif2{xi)f = - v(/i(xj) -/ 2 (xi)) 2 . 

Therefore, condition (2.2) on W is equivalent to the corresponding condition 
on H y . 

The following bound for the uniform entropy of T y in terms of N(Ti. y ,u) 
is well known (see [21], Lemma 2): 

In combination with (2.2) it implies that for some K > 

log N(F y u) <KdV log-. 

u 

For a fixed ips G $ the uniform covering numbers of the class (fs o .Fj = 
{(^(g) :g G ^"J} can be bounded as 

since for any probability measure Q on X x 3^ the Lipschitz condition on 
implies that 

(QiMvfW) ~ Myg(x))) 2 ) 1/2 < s~HQ(f - a?) l/ \ 

and, therefore, 

log N (<p s o rf, u) <KdV log 
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Step 2 (Nonadaptive bound). Theorem 6 applied to ipg°^ guarantees 
that for any t > with probability at least 1 — e~ l for all / £ 

<k( f-Y /2 / (P ^ (2//(a;)))1/2 log i/2 L du+ {VMvfWY' 2 



n J Jo 5u \ n 

To estimate the first term on the right-hand side one can easily check that 

(3.5) J ( log ^) ' du < 2s (^°&-^) ' for sG [0,e -1 ]. 

This inequality is well known and, moreover, the value 2 of the constant is 
irrelevant here. Hence, 

(P V j(»/(*)))Va ! 

log 1 / 2 — du 
ou 

f 6(FMyf(x))) 1/2 , „ 1 

= <r 1 / bg 1 / 2 -^ 

<2(P^(y/(x))) 1/2 maxfl,log 1 / 2 



Without loss of generality we can assume that F(pg(yf(x)) >n ; otherwise, 
the bound of the theorem becomes trivial. Therefore, 



max 



( 1 W 1 / 2 - ^1 < W 1 / 2 - 



which finally yields 

r(?vs(vm)) 1/a ... 1 1/9 i/o n 

/ log 1 / 2 — < 2(P^(y/(x))) 1 / 2 log 1 / 2 -. 

Jo ou 

We have proved that 

vMyHx))-VnMyHx)) < K (( dv } «Y /2 +fiV /2 

which implies that 

/r/y n t 

0(P W (y/(x)),P nW (y/(x))) < K —log - + - 

\ n on 

Step 3 (Union bound, adaptivity). The statement of the theorem now 
follows by applying the union bound and increasing K. Indeed, let us intro- 
duce the event 

A d ,s(t') = jv/ e : 0(Fw(yf(x)), F n <p s (yf(x))) < K log £ + ^ 
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which holds with probability 1 — e - *'. For a fixed t and for a fixed d and S, 
define t' according to the equality e~ l = (5e~ t )/(d 2 K), where K is chosen 
so that the condition J2dez + ,8eA bdr 2 jK < 1 holds. With this choice of t' 
the event A d ^{t') can be rewritten 

Ms = {v/ € T d : <KV<p S (vf(x)),V n <p s (yf(x))) 

fdV, n 1 Kd 2 t 
<K[ — log - + - log — =- + - 

\ n on o n 

and its probability is greater than 

beT 1 

Pr(^)>l-^. 
It implies that the probability of the intersection 

Se-* 

^ 1 „—t 

d 2 K 



Pr >l-£--->l-e 



\d,8 / 8,d 

This means that with probability at least 1 — e~ t all the events A& g hold 
simultaneously. But, obviously, the second term in the definition of A d § can 
be bounded by 

1 Kd 2 d n 
- log — — < K- log - 

no no 

and, thus, A§ d is a subset of the event 

M,S ^ M s = (v/ € F d :<p{¥vs{yf{x))Jny&{yf{x))) < K'(—log^ + - 
[_ \ n o n 

for some K' > K, which proves the statement of the theorem, since 

Pr(f]A' d<s )>Pr(f]A di 

\d,S / \d,8 

□ 

Proof of Theorem I. For a fixed d,j consider a class T dn = {/ € 
F:d(f;j) < d}. One can estimate the uniform entropy of T da as (see [21]) 

io g A^ 7 , n )<A^iogi+ (iyy 

For a fixed tps £ ^ the uniform covering numbers of the class ip$ o = 
{ip$(yf(x)) : f £ ^ i7 } can be bounded as 



US I >l-e"*. 



N(ip s o^ u)<N(^,Sv), 
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since, for any probability measure Q on X x y, the Lipschitz condition on 
tp$ implies that 

(Q(Mvm) - Mvg(x))) 2 ) 1/2 < 5~\Q(yf - yg ) 2 ) 1/2 = r\Q(j - g ) 2 f 2 , 

and, therefore, 

logJVfo o u) < K(dlog j- + (J- 

Using this estimate on the covering numbers, Theorem 6 now implies (in 
exactly the same way we used it in the proof of Theorem 7; only integration 
here is easier) that for any t > with probability at least 1 — e~ l for all 

Pw(y/(a))-PnW(2//(s)) 
(P^(y/(x)))V2 

< K ((Uo g -\ 1/2 + (1) a/2 {^&(yf(x))r a/i + ft_\ 1/2 

\\n 5 J \6 J n 1 / 2 \nj 

It remains to show that, possibly increasing K, this inequality holds for 
all d, 5 and 7. To do this we will use the above inequality with t replaced 
by t' + log-^p and, hence, e~* replaced by e _t ' = (e~ t 5j)/(Kd 2 ), where 

5,7 G {2~ fc : k > 1}. Then the union bound should be applied in the whole 
range of d, 5 and 7. Without loss of generality we assume that for all / E T 
and 5 G A we have Ftp$(yf(x)) > n^ 1 , and 7 can be restricted to the set of 
values satisfying 



7r /2 (p^(y/(*))r^ > m 1 / 2 



or, equivalcntly, 

Under these assumptions 

log-^— < Adlog — , 

which allows us to complete the proof by using the union bound and choosing 
the value of K large enough. □ 

Proof of COROLLARY 1. To see that Theorem 1 implies Corollary 
1 one should first notice that if T' ri ips{yf{x)) = 0, then the inequality of 
Theorem 1 can be solved for ¥(pg(yf(x)) to give 

pw&f/W) < nn = k mf(^io g = + (if am+a) „-^ + L 
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(we prove it below). Moreover, if P n Lp$(yf(x)) is of the same order of mag- 
nitude as /(/), then we will show that ¥(ps(yf(x)) will also be of the same 
order of magnitude as 1(f)- Finally, if ¥ n (ps(yf(x)) is larger than a constant 
times /(/), then Fip$(yf(x)) is dominated by a constant times ¥ n ips(yf(x)). 
After all this is proved, it remains to notice that, for a specific choice of 
functions <ps such that <fs(s) = 1 for s < 0, tps( s ) = f° r s> 5 and linear on 
[0,5], we have 

P(y/(x)<0)<P W (y/(x)) and ¥ n ^ 5 {yf(x))<¥ n {yf{x)<5). 

We will now explain how to solve the inequality of Theorem 1. We observe 
that it is of the form 

(3.6) y < x + ay 1/2 + by 13 , 

where y = ftps, x = F n (p$, < @ < l,a, b > 0. In our case also (3 = 1/2 — a/4. 
Define y\ and y 2 as the solutions of the equations 

Vi = ay{ , y 2 = by% 

and notice that 

y>ay 1/2 fory>yi; y>by P fory>y 2 . 

Assume that x < y\ + y 2 . Then (3.6) implies that y < K{y\ + y 2 ) for some 
absolute constant K > 0. Indeed, if we plug K(y\ + y 2 ) into the right-hand 
side of (3.6) we get 

x + a{K( yi + y 2 )) l/2 + b(K( yi + y 2 )f 

< (Vi + Ite) + K 1 / 2 a(y 1 + y 2 fl 2 + K^b( yi + y 2 f 

< (yi + y 2 ) + K 1 ' 2 {y l + y 2 ) + K\y x + y 2 f 

(since y x + y 2 > y\ and y x + y 2 > y 2 ) 

< (i + ^ 1/2 + + y 2 ) < K( yi + y2 ), 

if K is large enough. This shows that (3.6) fails for y > K(yi+y 2 ), and hence 
the solution of (3.6) is smaller than K(yi + y 2 )- Assuming that x > y\ + y 2 
and setting C := |, we get from (3.6) 

Cx < x + C 1/2 ax 1/2 + < x + C 1/2 x + = (1 + C 1/2 + C /3 )x, 

which implies C < 1 + C 1 / 2 + and hence y < -fTx for a large enough 
constant iC. Thus, always with large enough K we have y < K(x + yi + y 2 ), 
implying the result. □ 

Proof of Theorem 2. Let us make a specific choice of functions (fg. 
For each 5 6 A we set ips to be (p$(s) = 1 for s < 5, ips( s ) = for s > 25 and 
linear on [5,2(5]. 
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Let us fix / = J2k=i ^khk £ T, and for a fixed < d < T represent / as 

d T 
f = J2 X ^ + 7d(f) £ X' k h k , 

k=l k=d+l 

where j d (f) = Efc=d+i A fc and K = x k/ld{f)- 

Given N > 1, we generate an i.i.d. sequence of functions £i, ■ • • ,£jv accord- 
ing to the distribution P^(£i = fyfc) = A' fc for k = d+ 1, . . . , T and independent 
of {(A^lfc)}. Clearly, E^^(x) = X)fc=d+i X'k^k( x )- Consider a function 

d . N 

9( x ) = X kh k {x) +ld(f)jr J2^ k (x), 

k=l k=l 

which plays the role of a random approximation of / in the following sense. 
We can write 

V(yf(x) < 0) = E € P(!//(z) < 0, yg(x) <5) + E s P(y/(x) < 0, yg(x) > 5) 

(3.7) 

< E € F(p s (vg(x)) + EF^yg{x) > 5,E $ yg(x) < 0). 
In the last term for a fixed (x,y) € X x y we have 
F^(yg(x) > 5,E^yg(x) < 0) < F^yg(x) - E m (x) > 5) 

<exp(-iV5 2 /27d 2 (/)), 
where in the last step we used Hoeffding's inequality. Hence, 

(3.8) F(yf(x) < 0) - e -^ 2 /27i(/) < E ^s(yg(x)). 
Similarly, one can write 

E(P n <p s (yg(x)) < E^ n (yg(x) < 25) < F n (yf(x) < 35) 

(3.9) + E^F n (yg(x) < 25, yf(x) > 35) 

<F n (yf(x)<35) + e- m2 ^\ 

Clearly, for any random realization of the sequence £i, • • - ,£jVj t ne random 
function g belongs to the class F d +N- Convexity of the function 4>(a,b) and 
Theorem 7 imply that for any t > with probability at least 1 — e~ t for all 
5 E A and all / e T 

<f>(E ( F(ps(yg(x)),E ( F n <p s (yg(x))) < E^(P W (y 5 (a:)),P nW (y 5 (x))) 

rl 'V{d + N). n t 
<K{ — i ^log- + - 

n on 



26 V. KOLTCHINSKII AND D. PANCHENKO 

The fact that <p(a, b) is decreasing in b and increasing in a combined with 
(3.8) and (3.9) implies that 

(j>(¥(yf(x) < 0) - e- N52 ^ f \F n (yf(x) < 35) + e ~ N&2 / 2 ^) 

T JV(d + N)^ n t 

<K(^ ) -log- + - 

V n on 

Setting iV = 2(7^(/)/5 2 )logn, we get 

0(P(y/(z) < 0) - 1/ti, F n (yf(x) < 35) + 1/n) < K fY^illM. log " + L \ 

\ n on) 

where e n (f,6,d) = d+2{^{f) / 5 2 ) logn. Solving the last inequality for F(yf(x) < 
0) and changing the variable 35 ^ 5 gives the bound (that holds with prob- 
ability at least 1 — e _< ) 

(3.10) P(y/(x) < 0) < (VT 1/2 + (P„(y/(x) < 5) + VF) 1/2 ) 2 , 
where 

^ = ^(/,n, d , M ):=K(^M lo « + * 

V n on 

It remains to make the bound uniform over d and <5, which is done using 
standard union bound techniques. More specifically, replace t in the above 
bound by t'(d, 5) = t + 21og(l/<5) + 21ogd + c, where 5 E {2~ fc : fc > 1} and 



c:=21og 



v " 2 )- 

\fc=i / 

Then the union bound can be used to show that (3.10) [with t replaced by 
t'(d,S)] holds for all d and all 5 G {2~ k : k > 1} simultaneously with proba- 
bility at least 1 — p, where 

oo / oo \ 2 

P<e- l - C ]T e -21ogfc-21ogd _ e -t-c / £-2 I =e _ tj 

fc=l,d=l \fc=l / 

and, hence, we also have with probability at least 1 — e _t 

W(yf(x) < 0) < inf mi{W 1/2 {f,n,d,5,t'(d,5)) 
<5e{2- fe : fc>l} d 

+ (F n (yf(x) < 5) + W(f,n,d,5,t'(d,5))) 1/2 ) 2 . 

Taking into account the monotonicity of the function e n (f,5,d) with respect 
to 5 (and increasing the value of the constant K), it is now easy to extend 
the infimum over 6 to all 5 S (0, 1]. Increasing the value of K further allows 
one to rewrite the bound as 

P(y/(x) < 0) < inf (U 1 ' 2 + (P„(y/(x) < 5) + U) 1 ' 2 ) 2 

56(0,1] 
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with U defined in the formulation of the theorem, which completes the proof. 
□ 

Theorem 3 is a special case of Theorem 4; thus we will proceed by proving 
Theorem 4. 

Proof of Theorem 4. We will proceed to prove Theorem 4 in several 
steps. 

Step 1 (Random approximation). Consider functions ips the same as 
in the proof of Theorem 2. Let A G V{H) and f(x) = Jh(x)X(dh). Con- 
sider an element c G C m (A), that is, c = (a\, . . . , a m , A 1 , . . . , A m ), such that 
A = J2iLi a j^ an d ^ £ V{TL). We interpreted c as a decomposition of A 
into m clusters, or in other words, the decomposition of the set {hi} into 
m clusters. This time we will generate functions from each cluster indepen- 
dently from each other (and, as before, independently of the data) and take 
their weighted sum to approximate f{x). Given N > 1, let us generate in- 
dependent random functions £,i(x), k < N, j < m, where for each j < m, the 
£j£'s have the distribution 

hi) A'i{/,;}) Aj. 1<T. 

Consider a function 

^ m N I N 

9( x ) = k H a i Yl &( x ) = at H 9k(x), 
j=i k=i k=i 

where gt{x) = J2 1 JLi a jCi( x )- ^or a fi xe( i x £ X and k < N, the variance of 
gk with respect to the distribution P^ = P^i x • • • x P^m is 

m m 

Yav^(g k (x)) = ^2a 2 j Y&T^{(x)) = ^2a 2 j a 2 XJ (x) = a 2 (c;x). 

3=1 3=1 

The main difference from the proof of Theorem 2 is that in (3.7) we also 
introduce the condition on the variance a 2 (c;x). Namely, one can write 

Hyf(x) < 0) < E^ip s (yg(x)) + F(a 2 (c, x) > 7 ) 

+ EF^yg(x) > 5, yf(x) < 0, a 2 (c; x) < 7 ). 
Similarly to (3.9) one can also write 

^F n ^(yg(x)) < E^F n (yg(x) < 25) 
(3.12) < F n (yf(x) < 35) + F n (a 2 (c; x) > 7 ) 

+ F n F^yg(x) < 25,yf(x) > 35,a 2 (c;x) < 7 ). 
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Step 2 (Bernstein's inequality). To bound the last terms on the right- 
hand sides of (3.11) and (3.12) we note that we explicitly introduced the 
condition on the variance of the <?fc's, since for a fixed x G X we have 
Var^(<7fc(rc)) = <r 2 (c;x). Therefore, instead of using Hoeffding's inequality as 
we did in the proof of Theorem 2, it is advantageous to use Bernstein's in- 
equality, since it takes into account the information about the variance. We 
have 

F^yg(x)>6,yf{x)<0,a 2 (c;x) < 7) 

< ^J2(yg k (x) - y^g k {x)) > NS\ Var 5 ( 5l (x)) < 7 j 

/ 1 [N5 2 \\ ( INS' 

< exp mm , JM o I I = exp 



4 V 7 // V 4 7 

since we assume that 7 > S. Taking N = 4(7/5 2 ) logn we get 

(3.13) P(i//(x) < 0) < E^<p s (yg(x)) + P(<r 2 (c; x) > 7) + n" 1 . 
Similarly, applying Bernstein's inequality to the last term of (3.12) yields 

(3.14) E^ n ip s (yg(x)) < F n (yf(x) < 35) + P n (a 2 (c; x) > 7) + n" 1 . 

Step 3 [Relating E^Fips(yg(x)) to K^P n (pg(yg(x))]. Our next goal is to 
relate K^Fips(yg(x)) from the right-hand side of (3.13) to E^F n ips(yg(x)) 
from the left-hand side of (3.14). 

For any realization of random variables £1, the function g{x) will belong 
to the class T m N- Convexity of the function cp(a,b) and Theorem 7 imply 
that for any t > with probability at least 1 — e - ' for all S G A, AG V(Tt) 
and f(x) = J h(x) dX, and any c G C m (A), 

<MEzP<p s (yg(x)),EzP n (ps(yg(x))) < E^(Fip s (yg{x)),F nV > 5 (yg{x))) 



<K 



fVmN , n i\ 

log-r + - . 

V n on) 



The fact that 0(tt, 6) is decreasing in 6 and increasing in a combined with 
(3.13) and (3.14) [recall that N = 4(7/<5 2 ) logn] implies that 

<j>(F(yf(x) < 0) - F(a 2 (c; x) > 7) - n" 1 , 
Pn(y/(x) < 35) + P„(ci 2 ( C ; x) > 7) + n^ 1 ) 

y r ( Vrwy , o re i 

V no z n 

Solving the last inequality for F(yf(x) < 0) one can get that with probability 
at least 1 — e~ l for all 5 G A, any 7 > S, for any A G ViTL) and /(x) = 
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/ h(x) dX, and any c £ C m (A), 

P(y/(x) < 0) < K^¥ n (yf(x) < 35) +F n (a 2 (c;x) > 7) 

(3-15) 

+ P a 2 c; x > 7 + — ^ log 2 - + - 
o z on 

Step 4 [Bounding P(cr 2 (c;x) > j)]. It remains to estimate P(<7 2 (c;x) > 
7). This is done very similarly to steps 1-3 above. Let us generate two 
independent sequences and as above and consider 

1 N / m \ 2 -, N 

fc=l V/'=1 / fc=l 

where 

/ m \ 2 

(3-16) Ux) = k(E* j (% 1 -% 2 )j . 

Let us make a specific choice of functions <p~. For each 7 S A we set ip~ to 
be y 7 (s) = for s < 2 7 , </? 7 (s) = 1 for s > 3 7 and linear on [2 7 , 37]. One can 
write 

P(cr 2 (c;x) > 47) =E ? P(cr 2 (c;x) > 4 7 , a%{c; x) > 37) 

_ 2 /„. ™\ \ /!_. _ 2 



(3.17) 

<E € Pp 7 (<7&(c;s)) 



+ E^P(a^ (c; x) > 4 7 , cr^ (c; x) < 3 7 ) 
at(c;s)) 

+ EP f (a^(c;x) < 3 7 , a 2 (c;x) > 4 7 ). 



Similarly, one can write 

E 5 P n ^ 7 (a^(c;x)) <E 5 P n (a^(c;x) > 2 7 ) 

(3.18) <Pn(fT 2 (c;x)> 7 ) 

+ P n P 5 (^(c;x) > 2 7 ,(j 2 (c;x) < 7 ). 

Next we will show that there exists a large enough absolute constant K > 
such that 

(3.19) P*(*&(c;z) > 2 7 , C j 2 (c;x) < 7) < e xp(-^) 



and 

(3.20) P ? (cr^(c;x) < 37, <r 2 (c;x) > 47) < exp 
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First of all, let us notice that a 2 N {c;x) = N^ 1 J2iLi Ck( x ), where ^ are i.i.d. 
random variables defined in (3.16) and E^^rr) = a 2 (c;x). Moreover, since 
i/: ' • e/: 2 e H, we have ^(x) - ^ 2 {x)\ < 2 and |&(a?)| < 2. Finally, the 
variance 

Var 5 (^i) < E^f < 2E^i = 2a 2 (c;x). 
Hence, Bernstein's inequality implies that 

F(:(a 2 N (c;x) - a 2 (c;x) < 2^a 2 (c; x)j/K + 8j/(3K)) > 1 - exp 

and 

¥/:(a 2 (c;x) - a 2 N (c;x) <2^Ja 2 (c;x)-//K + 8j/(3K)) > 1 - exp 

It is now easy to check that for large enough K > 0, given a 2 (c;x) < 7, 
the first inequality will imply af^(c;x) < 27 [with probability at least 1 — 
exp(— ^?)], thus proving (3.19) and, given a 2 N (c;x) < 37, the second inequal- 
ity will similarly imply a 2 (c;x) < 47, thus proving (3.20). 

If in (3.19) and (3.20) we set N = Kj logn, then with this choice of ./V 
one can rewrite (3.17) and (3.18) as 

(3.21) F(a 2 (c; x) > 4 7 ) < E s P<p 7 (<7^(c; x)) + n" 1 
and 

(3.22) E^¥ n ^(a 2 N {c;x)) <F n (a 2 (c;x) > 7) +n~\ 

For any realization of ^' 1 ,^' 2 , the function a 2 N {c;x) belongs to the class 

{-, N / m \ 2 m \ 

■jr E (E "M 1 - K 2 ) ) ■■ K\ K 2 e n, * 3 > 0, £ a, = i|. 

Since the class 7i satisfies condition (2.2), it is easy to show (see, e.g., [21] 
for a similar computation) that the uniform covering numbers of J~n rn can 
be bounded by 

2 

\ogN(F Nm ,u) <KVNmlog-, < u < 1. 

u 

The rest of the argument is similar to the above. Convexity of the function 
(j)(a, b) and Theorem 7 imply that for any t > with probability at least 
1 - e~* for all 7 G A, A G 7? (ft) and any c G C m (A), 

<XE 5 Pv9 7 (<4(c; aO),%lW^(c; 1))) 

< E^(P^ 7 (a^(c; x)),F n <p 7 (a 2 N (c; x))) 

/ FmiV n t \ 

<K log- + - . 

V n on) 
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The fact that <p(a, b) is decreasing in b and increasing in a combined with 
(3.21) and (3.22) (recall that N = Klogra/7) implies that 

a 2 (c; x) > 4 7 ) - n~\ P n (a 2 (c; x) > 7) + n _1 ) < K\ log 2 - + - 

V nj dn 

Solving the last inequality for P(cx 2 (c;x) > 47) we get that with probability 
at least 1 — e~* for any 7 G A, for any A G ViTL) and any c G C TO (A), 

P(^ 2 (c; s) > 4 7 ) < K ( P„( £ t 2 (c; a) > 7) + — log 2 j + - 
V 717 n 

Finally, we combine this with (3.15) and notice that since we assume that 

Vm , 9 n 7m7 , 9 n 
7 o z 

Thus, with probability at least 1 — e - * for any 8 G A, any <5 < 7 G A for any 
\eV{H) and any c G C m (X), 



Hyf(x) < 0) 

(3.23) 



Vm^f 2 n 



< K U n (yf(x) < 35) + P„(a 2 (c; x) > 7/4) + log z - + - 1 . 

Using the union bound one can show that with a larger constant this in- 
equality holds for all m > 1, also with probability at least 1 — e~ *. Finally, to 
obtain the statement of Theorem 4, we need to make the change of variables 
35 — ► 5, 7/4 — > 7, and, in order to preserve the condition 7 > 5, we notice 
that from the very beginning we could have assumed that 7 > 125 and then 
deal with the case of 7 G [5, 125] by increasing the value of K. □ 



We turn now to the proof of Theorem 5. It will be based on several facts. 
First of all, we need a slight modification of Theorem 2 in [8]. 
Let J- be a class of functions / from X into [0, 1] . We define the Rademacher 
process R n (f), f G F, as 

n 

Rn{f) -n-^eifiXi), 

i=l 

where {e^} is a Rademacher sequence [Pr(ej = 1) = Pr(ej = — 1) = 1/2] in- 
dependent of {Xi}. Denote also 



Rn{F) :=sup|i4(/)|. 
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Theorem 8. Suppose that for all t> with probability at least 1 — e t 

E £ sup \R n (f)\ <C{cfi n {^)+5 n {t)) where r > 0, 
feJ",P n f<r 

(fin is a nondecreasing concave possibly data- dependent function with (f) n (0) = 
0, S n (t) >y t and C > is a constant. Let f n be the largest solution of the 
equation <fi n (^/r) = r. Then, there exists K > such that with probability at 
least 1 — e _< for all f G T 

' t + log log n s 



<K(¥ n f + r n + S ri 



n 



Next we need the following bound on the expected sup-norm of the 
Rademacher process. Let 

Dp„,2(F) ■= sup dp n , 2 (/,5) 
denote the £2 OPn) -diameter of T . 



Lemma 1. Let T be a class of measurable functions from X into [0, 1] 
such that G T. Then there exists a constant K > such that for all n > 1 
and t>0 



Jn 



n OPn.l 



n 



t 

+ -• 

n 



Proof. For given t > and n > 1, there exists a map tt = ir n ^ : T 1— ► JF 
such that 

card(vr.T) = iV dp?i 2 L? 7 , ^ J and d Fn>2 (f, tt/) < 

This implies that 

EAM < E e sup |i? n (vr/)| + E e sup \R n (f - tt/)|. 

By a standard entropy bound, we have 



E £ SU P |i? n (7T/)| < 



K f D Vn,2(F) 1/2 



'i/n 



H^(f,u)du 



with some constant if > 0. Let now T' be a (i/n)-net for T with respect to 
the metric dp„ 1. Note that, since the functions from T take their values in 
[0,1], 



p„,i(/,/')<- 
n 



4 n , 2 (/,/0<- 



dp n ,2(/', vr/) < d Pn , 2 (/, /') + d Wn>2 (f, tt/) < 2 
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Therefore, we get 

E e sup|i?n(/-7r/)| 



Since 



< E e supj \R n (f' -9)\:f'e^,ge ttF, d Fn , 2 (f, nf) < 2^|. 

card(F' x Trf) = N dfnA (V, N d ^ 2 U^j, 

we get, using standard bounds for the expectation of a finite maximum of a 
Rademacher process, 



e« sup K(/ - rf) \ < kJI± (h^ (V, i) + H d ^ 2 (r, ,/T 

with some > 0, which in view of the trivial bound 

implies the statement of the lemma. □ 

Let Q € V(X). For a set E of positive numbers and a function TV : E i 

let 

^Q, P ,iV := {/ G ^: Ve G N^faCe) < N(e)}. 
Lemma 2. For all e € E 



1/2 + * 
n 



H d Q J^Q >P ,N, (2 + C)e) < ffJV(e) log i 



with some constant K > 0. 

Proof. First note that / S Fq^ n implies that 
\/ e € E 3Tt' CTt: f £ sconv(Ti') 

and 

N dQp {7i\Ce)<N{e). 

Let f = Y1 ^jhj, hj £ Ji! and < 1. Then there exists 7Y' C W such that 

card(7Y / ) < N(e) and for all h£TC' there exists g £?{' such that dQ >p (h, g) < 
Ce. Hence, one can define {h'j} C 7i' such that m&Xj dQ tP (hj ,h'j) < Ce. Let 
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now Tie denote a minimal e-net for 7i with respect to d>Q tP . Define hj G 7i e 
in such a way that for all j, dQ tP (hj,h'j) < e, which, of course, implies 

maxd,Q tP (hj,hj) < (C + l)e. 

3 

Clearly, we can also assume that 

card{^} < card{/^} < card(W') < N(e). 
We can conclude that 

d Q, P I Z X i h i > Z A A' ) ^ Z I x j I d Q,v ( h j > ^ ) 
V i j / j 



— Z I X 3 I m ? ix d Q,p(hj i fyj ) 



< (C + l)e. 
The above argument shows that Ve£i? 

•^q,p,jv C [scoiw N{£) (H £ )] {c+1)£ , 
where [-] e denotes the e-neighborhood w.r.t. the metric d,Q tP and 

( d d ~| 

sconvrf(^) :=<^] Xjhj : |Aj| < 1 V jhj G £/ > . 

[j=i j=i J 

Using Lemma 3 in [21], we obtain that Ve £ E 1 

at rr c , g , rv ^/ e 2 card(^)(iV( £ ) + 4 g -^) \^) 

which immediately implies the bound. □ 

Lemma 3. Suppose that 7i satisfies (2.2). Then there exist constants 
K > 0, C > such that for all t > KV(Ti.)logn, with probability at least 

1 — e~* for all f £ T and all e > 



N drn2 (f,Ce)<N dr2 (f,e) 

and 

N df2 (f,Ce)<N drn2 (f,e). 

Proof. Let 

H:={{h l -h 2 ) 2 :h 1 ,h 2 eH}. 
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Since — 1 < h < 1 for h £Tl one can write 

(fa - - (fei - /i' 2 ) 2 ) 2 < 32((/*i - Z^) 2 + (h 2 - h' 2 f ). 

Hence the uniform covering numbers of H can be estimated as 

sup N dQ2 (H,e)< sup Nj {H,e/S) = 0{e~ w ^) 
QgV(x) ' Qev(x) 

using (2.3). Now, applying Theorem 7 and (3.5), we get that with probability 
at least 1 — 2e _i , for all h E H 



Fh- 



ih<K 



// (Fh)Vlogn \ 1/2 + ^(P/i)ty /2 



and 



,h-Fh<K 



ih)V logn 



n 



1/2 



+ 



l h)t\ 1 / 2 



n 



For t > XV log n these inequalities imply 

Ph<K[ F n h + - 
\ n J 

This yields that with probability 1 — 2e _t for all h\ , h 2 



and F n h< K[ Ph + - 
n 



dp n , 2 (hi,h 2 ) < C 



>(hi,h 2 )V 



and 



dp, 2 (hi,h 2 ) < C 



dv n ,2(hi,h 2 ) V \ - 
n 



Now, by the definition of N dv2 (f,e), there exists 7i' C TL such that / £ 
sconv(W / ) and N dr2 (H' ,e) = N dp2 (f,e). Hence, with probability at least 

1 — 2e~', for any e > \ f^, we have 

N dFn M,Ce) < N drnt2 (H',Ce) < N dr 2 {H',e) = N df2 (f,e), 
and similarly 

N dp2 (f,Ce)<N dpn Jf,e), 

which immediately implies the bound of the lemma (after a minor rescaling 
and changing the constants). □ 



Let us define a sequence 



•v 



for j > 0. 
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Denote m n (t) := min{j : ej > 1}. Let N be a nonnegative nonincreasing func- 
tion on IR + taking constant values on the intervals (0,£i), [£j,£j+i), j > 1. 
Define 

?w n , N := {/ £ T : Nd rn2 (f,£j) < N( £j ), j = 0,.. .,m n (t)}, 
Fv,N ■= {/ ef-.Nd^if^Ej) < N( £j ), j = 0, . . . ,%(«)}, 
:= {/ Gf:JV rfpiii2 (/,C 2 £j ) < N(ej), j = 0, . . . ,m n (i)}. 
Then it follows from Lemma 3 that: 



Lemma 4. 



Let us introduce the function 



%l){x) :=^ N {x) := ^N{e) log ^de. 



Lemma 5. There exists K > such that with probability at least 1 — e * 
for all f G .Fpjv 



^{yf{x)<0}<K inf 

56(0,1] 



,{v/(s)<*} + 4(*) + 



i + log log n 
nb 2 



Proof. We apply Lemma 1 with t replaced by (2 + C ) t/5 to the 
class 

^={vo/:/€%}U{0}, 

where y is the function equal to 1 for u < 0, equal to for u > 5 and linear 
in between and (<po f)(x,y) := ip{yf{x)). This gives the bound 

E e SUp |-Rn(c?)| 



< 



K 



2 + C 2 ft 



H 1/2 

5 V n dr "< 



{2 + C 2 )H 



S 2 



+ 



1/2 



(2r) 

(2+C 2 )/S^/tJ^ 



+ 



(2 + C 2 ) 2 t 



5 2 



Since the Lipschitz norm of ip is i, we have 



dw n ,2(<P° f,<P° 9) < g(k> n: 2(f,g) 
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and 



,l((po f,ipog) < -d Fn ^(f,g). 



Therefore, we can upper bound the expression in the brackets by 



2 + C 2 ft 



+ 



H 



1/2 



n 

1 M(2r) 1 /2 



P,JV, 



(2 + C 2 )H 



n5 



+ 1 



H dil >2 ^,N,u) + l\du 



[adding 1 to the square root of the entropy is due to the definition of 
the class Q which includes the function 0; we also use here the inequality 
^\og(N + 1) < s/\ogN + 1]. On the event {Tw,N C ^TP n ,jv}) which according 
to Lemma 4 occurs with probability at least 1 — e - *, we can upper bound 
the £2(^11)- and £i(P n )-entropies involved in the last expression by the en- 
tropies of the class .Fp n ,7v, which can be bounded using Lemma 2. Namely, 
we have, for all / G Fp n ,N, 

^ Pn , 2 (/, Chj) < N( ej ), j = 0, . . . ,m n (t), 

which according to Lemma 2 implies that 

H dFnt2 (f Fn , N , (2 + C7 2 ) £j ) < KN( £j ) log(l/ £j ). 

Therefore, denoting Ej := (2 + C 2 )Ej and using monotonicity of the entropy, 
we get 

f5(2r)V2 „ 



(2+C^t/n 

j:e J <<5(2r) 1 /2 



< if £ (2 + C 2 )(e i+1 - Sj )^N( ej ) log(l/ £j ) 

j: (2+C 2 )e J <<5(2r) 1 /2 
f2\/25v^ 



< K 



J N(u) \ \ogu\ du. 

Note also that since the class 7i consists of functions taking values in {—1, 1}, 
for any probability measure Q we have dq 2 (h\, /12) = 2dQ 5 i(/ii, /J2), which 
implies that N dq2 (f,y/2e) = N dqi (f,e). Thus, 

V/G# PniJ v iY dPni2 (/,C 2 e o)<iV(eo) 

=► V/eJ^jv iY dPna (/,C 4 e g/2)<iV( £ o). 
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Since Eq = J — , this, in view of Lemma 2, yields the bound 



H^^M^CV^- j <KN^-jlo g] /- 
Collecting the above bounds gives on the event {.Fp,jv C fip n ,N} 



2 + C 2 ft 



(2 + c 2 yt 

n8 



+ 1 



1 f S ( 2r ) 1/2 1/2 



< 



K 



\ 



N 



log 



+ 



2 V / 2<5\A : 



^Jn(u) \ log-uj du 



which, using the fact that the function x i— > f£ y/N(u)\ logn| du is concave, 
can be bounded by Kcj) n (y/r), where 

1 fS-^r 



1 / v / 

4>n(Vr) -=(f>n,5(Vr) ■= T / y iV(re) | log u| (in. 



Thus, with probability at least 1 — e t , 



E e sup \R n (g)\<K[MV^) + 



geG,Png<r 



n 



P 



and Theorem 8 implies that also with probability at least 1 — e * for all 

t + log log re 



<K[¥ n g + r n + 



■re 



where f n is the largest solution of the equation (p n (^/r) = r, which in our 
case is equal to £^(6). Therefore, for a fixed S G (0,1] with probability at 
least 1 — e~* for all / € 7\ 



P,7V 



nvf(x) < 0} < P(y> ° f) <K[F n (ip o /) + et(5) + 
<#(p n {y/(x)<<5} + 4(<5) + 



t + log log re 



n<5 2 
i + log log re 



re 



<5 2 



It remains to make the bound uniform in 5 £ (0, 1] by applying it with 
5 = 5j = 2~ 3 and t replaced by t + 21og(j + 1), using the union bound along 
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with the monotonicity of the expressions involved with respect to 5, and 
properly adjusting the value of the constant K. □ 

Proof of Theorem 5. We will prove, in fact, an improved version of 
the result (see the remark after the statement). To simplify the notation, we 
remove the term e~ 2V ^ 2+v ^ from the definition of H n (f, e) and the follow-up 
definition of ip n {f,t,5)] this omission, however, does not change anything in 
the proof. By the condition on the class 7i, 



sup N d (H,e) = 0(e 
QeV(X) 



-V\ 



£>0. 



Clearly, we have 



N dFna (f,e)< sup N d (H,s), e > 0. 
Q&V{S) 



As before, £j = 2 J ^M and let J := {j > 0:£j < 2}. Denote by N the set of 
nonincreasing step functions on M + with jumps only at the points Ej, j > 0, 
and such that 

N( £j ) < Kef, j G J. 
Assume also that, for N £j\f and e < So, N(e) = N(sq). Then 

Pr|3/e F36e (0,1]: 



»{yf(x) <o}>K 



\{yf(x)<5}+i n (f,t,5) + 



t + log log n 



6 2 



<E ]T I{N drna {f : e 3 ) = N{e J ),jeJ) 



xl(3fe^ n , N 3tf€(0,l]: 



P{y/(x) < 0} > K 



\{yf(x) < 5} 



t + log log n 



n 



6 2 



-:B, 



where we used the facts that, on the event {N dfn 2 (f,£j) = N(ej), j G J}, 

/ € T => f£T VlN 
and also we have on the same event ip n (f,t,u) < iPn(u), u>0, which yields 

in(f,t,6)<et»(6). 
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According to Lemma 4, for all N £ M, jFf n ,N C ^tp,jv with probability at 
least 1 — e _i . Also, by simple combinatorics, 

V 



card (AT) < Y[ K 



1 



Therefore, we can use Lemma 5 and further bound B by 

J2 EI(N dFna (f,e j ) = N(e J ),j€J) 
NeJV 



x/3/e% 35 £ {0,1]: 



>{yf(x) <0}>K 



l {yf(x)<5} + et N (6) + 



t + log log n 



n 



S 2 



^ IT K (-) V \ SU P Pr ( 3 / G ^.Ar 3 6 G (0, 1] : 

Hvffr) < °} 



> A" 



V n {yf(x) < 5} 



t + log log n 



n 



S 2 



<2exp|-t + ^^yiog^- + logA^| 



+ e 



-/ 



n 



< 2 exp | -t + C log z - + log 2 1 , 

which implies the bound of the theorem (subject to adjusting the constants). 
□ 



4. Concluding remarks. We have developed several new complexity mea- 
sures of functions from the convex hull of a given base class and proved 
adaptive margin type bounds on the generalization error of ensemble classi- 
fiers in terms of these complexities. The complexities are based on measuring 
sparsity of the weights of a convex combination and clustering of the base 
functions involved in it. Hopefully, they can provide some insights to the de- 
velopers of classification algorithms about the relative importance of various 
parameters influencing the performance of classifiers. It might be possible 
to combine several types of bounds discussed in the paper into a bound that 
takes into account different complexity characteristics, but our goal here is 
not to develop "the Mother of All Bounds," but rather to explore several 
possible approaches to the problem. 
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The results of the paper suggest that it might be of interest to study 
experimentally the statistical properties of base classifiers in ensembles out- 
put by classification algorithms (in particular, their clustering properties) in 
connection with generalization ability of the algorithms. (Some preliminary 
results in this direction for AdaBoost and other classification algorithms with 
real and simulated data can be found in [20] and more results are in [1].) 
Another interesting line of research might be related to proving that boost- 
ing type algorithms do output combined classifiers with a certain degree of 
clustering of base classifiers in the ensemble and a certain degree of sparsity 
of their weights. (The results of [30] show that the sparsity of the coefficients 
indeed takes place in the case of support vector machines.) 

Our main goal has been to develop margin-type bounds on generalization 
error in terms of sparsity and clustering, but the complexities we introduced 
might be of interest in some other problems, for instance, in studying conver- 
gence rates of classification algorithms to the Bayes risk. Recent results on 
consistency [15, 22, 33, 34] and convergence rates [6, 7] of boosting-type al- 
gorithms suggest that some regularization of the algorithms (either by early 
stopping, or by penalization) might be needed in order to achieve reasonable 
convergence rates. However, the precise form of this regularization is still an 
open question and it depends crucially on which complexity measures are 
used to take into account the sparsity and the clustering properties of the 
algorithms. Some of the complexities discussed in the paper might be used 
as penalties, especially, the complexities based on the notion of variance of a 
convex combination (this is also computationally attractive). Another area 
where these complexities might be very useful is the problem of optimal 
aggregation of estimators in regression or classification (see [3, 31]). 

It should be emphasized that the complexities of convex combinations we 
have introduced are by no means the only possible, but they are on the other 
hand very typical, representing some features of functions in the convex hull 
that are of importance in classification. 
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